Citation¶

Much of the code and examples are copied/modified from

Blueprints for Text Analytics Using Python by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler (O'Reilly, 2021), 978-1-492-07408-3.

  • https://github.com/blueprints-for-text-analytics-python/blueprints-text
  • https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb

Setup¶

In [1]:
%run "/code/source/config/notebook_settings.py"
In [2]:
pd.set_option('display.max_colwidth', None)
In [3]:
from source.library.text_analysis import count_tokens, tf_idf, get_context_from_keyword, \
    count_keywords, count_keywords_by, impurity
In [4]:
with Timer("Loading Data"):
    path = 'artifacts/data/processed/reddit.pkl'
    df = pd.read_pickle(path)
2023-03-03 03:28:28 - INFO     | Timer Started: Loading Data
2023-03-03 03:28:28 - INFO     | Timer Finished: (0.22 seconds)

Exploratory Data Analysis¶

This section provides a basic exploration of the text and dataset.

Dataset Summary¶

In [5]:
df.head(1)
Out[5]:
id subreddit title post impurity post_clean all_lemmas partial_lemmas bi_grams adjs_verbs nouns noun_phrases entities post_length num_tokens language
0 74qv99 Honda J32A3 Block with J35Z2 Crank and Rods? Hello, <lb><lb>I have my J32A3 egine open, ready to put new rings and all. I would like to know if I can swap the J35Z2 crank and rods from an Accord Sedan v6 3.5. I would like to keep my pistons, just change the rods and crank to have a little more displacement. This is a doable option? there is something I need to know first? This is for my Acura TL 3G 2004 Manual Transmission. I really would like to know if it's possible and if possible what differences are between then J35Z2 and let's say the classic J35A3.<lb><lb>Thanks in advance. 0.01 Hello, I have my J32A3 egine open, ready to put new rings and all. I would like to know if I can swap the J35Z2 crank and rods from an Accord Sedan v6 _NUMBER_ . I would like to keep my pistons, just change the rods and crank to have a little more displacement. This is a doable option? there is something I need to know first? This is for my Acura TL 3G _NUMBER_ Manual Transmission. I really would like to know if it's possible and if possible what differences are between then J35Z2 and let's say the classic J35A3. Thanks in advance. [hello, i, have, my, j32a3, egine, open, ready, to, put, new, ring, and, all, i, would, like, to, know, if, i, can, swap, the, j35z2, crank, and, rod, from, an, accord, sedan, v6, _number_, i, would, like, to, keep, my, piston, just, change, the, rod, and, crank, to, have, a, little, more, displacement, this, be, a, doable, option, there, be, something, i, need, to, know, first, this, be, for, my, acura, tl, 3, g, _number_, manual, transmission, i, really, would, like, to, know, if, it, be, possible, and, if, possible, what, difference, be, between, then, j35z2, and, let, us, say, ...] [hello, j32a3, egine, open, ready, new, ring, like, know, swap, j35z2, crank, rod, accord, sedan, v6, _number_, like, piston, change, rod, crank, little, displacement, doable, option, need, know, acura, tl, 3, g, _number_, manual, transmission, like, know, possible, possible, difference, j35z2, let, classic, j35a3, thank, advance] [j32a3-egine, egine-open, new-ring, j35z2-crank, accord-sedan, sedan-v6, v6-_number_, doable-option, acura-tl, tl-3, 3-g, g-_number_, _number_-manual, manual-transmission, classic-j35a3] [open, ready, new, like, know, swap, like, change, little, doable, need, know, like, know, possible, possible, let, classic] [j32a3, egine, ring, j35z2, crank, rod, accord, sedan, v6, _number_, piston, rod, crank, displacement, option, acura, tl, g, _number_, manual, transmission, difference, j35z2, j35a3, thank, advance] [egine-open, new-ring, j35z2-crank, doable-option, classic-j35a3] [Accord Sedan (PRODUCT), first (ORDINAL), Acura (ORG), 3 (CARDINAL), g _NUMBER_ Manual Transmission (ORG)] 542 46 English

Numeric Summary¶

In [6]:
hlp.pandas.numeric_summary(df)
Out[6]:
  # of Non-Nulls # of Nulls % Nulls # of Zeros % Zeros Mean St Dev. Coef of Var Skewness Kurtosis Min 10% 25% 50% 75% 90% Max
impurity 5,000 0 0.0% 1,023 20.0% 0.0 0.0 1.0 1.9 7.8 0.0 0.0 0.0 0.0 0.0 0.0 0.2
post_length 5,000 0 0.0% 0 0.0% 679.1 452.5 0.7 2.5 9.2 256.0 307.0 381.0 538.0 812.2 1,217.1 4,174.0
num_tokens 5,000 0 0.0% 0 0.0% 52.5 35.2 0.7 2.6 10.0 12.0 24.0 30.0 42.0 63.0 93.0 365.0

Non-Numeric¶

In [7]:
hlp.pandas.non_numeric_summary(df)
Out[7]:
  # of Non-Nulls # of Nulls % Nulls Most Freq. Value # of Unique % Unique
id 5,000 0 0.0% 74qv99 5,000 100.0%
subreddit 5,000 0 0.0% Lexus 20 0.4%
title 5,000 0 0.0% Need some advice 4,995 99.9%
post 5,000 0 0.0% Hello, <lb><lb>I have my J32A3[...] 5,000 100.0%
post_clean 5,000 0 0.0% Hello, I have my J32A3 egine o[...] 5,000 100.0%
all_lemmas 5,000 0 0.0% ['hello', 'i', 'have', 'my', '[...] 5,000 100.0%
partial_lemmas 5,000 0 0.0% ['hello', 'j32a3', 'egine', 'o[...] 5,000 100.0%
bi_grams 5,000 0 0.0% ['j32a3-egine', 'egine-open', [...] 5,000 100.0%
adjs_verbs 5,000 0 0.0% ['open', 'ready', 'new', 'like[...] 5,000 100.0%
nouns 5,000 0 0.0% ['j32a3', 'egine', 'ring', 'j3[...] 5,000 100.0%
noun_phrases 5,000 0 0.0% [] 4,947 98.9%
entities 5,000 0 0.0% [] 4,565 91.3%
language 4,996 4 0.1% English 1 0.0%

Examples¶

In [8]:
df['post'].iloc[0][0:1000]
Out[8]:
"Hello, <lb><lb>I have my J32A3 egine open, ready to put new rings and all. I would like to know if I can swap the J35Z2 crank and rods from an Accord Sedan v6 3.5. I would like to keep my pistons, just change the rods and crank to have a little more displacement. This is a doable option? there is something I need to know first? This is for my Acura TL 3G 2004 Manual Transmission. I really would like to know if it's possible and if possible what differences are between then J35Z2 and let's say the classic J35A3.<lb><lb>Thanks in advance."
In [9]:
'|'.join(df['partial_lemmas'].iloc[0])[0:1000]
Out[9]:
'hello|j32a3|egine|open|ready|new|ring|like|know|swap|j35z2|crank|rod|accord|sedan|v6|_number_|like|piston|change|rod|crank|little|displacement|doable|option|need|know|acura|tl|3|g|_number_|manual|transmission|like|know|possible|possible|difference|j35z2|let|classic|j35a3|thank|advance'
In [10]:
'|'.join(df['bi_grams'].iloc[0])[0:1000]
Out[10]:
'j32a3-egine|egine-open|new-ring|j35z2-crank|accord-sedan|sedan-v6|v6-_number_|doable-option|acura-tl|tl-3|3-g|g-_number_|_number_-manual|manual-transmission|classic-j35a3'
In [11]:
'|'.join(df['noun_phrases'].iloc[0])[0:1000]
Out[11]:
'egine-open|new-ring|j35z2-crank|doable-option|classic-j35a3'

Explore Non-Text Columns¶

Impurity¶

In [12]:
ax = df['impurity'].plot(kind='box', vert=False, figsize=(10, 1))
ax.set_title("Distribution of Post Impurity")
ax.set_xlabel("Impurity")
ax.set_yticklabels([])
ax;
In [13]:
df[['impurity', 'post', 'post_clean']].sort_values('impurity', ascending=False).head()
Out[13]:
impurity post post_clean
4684 0.18 I'm looking to lease an a4 premium plus automatic with the nav package.<lb><lb>Vehicle Price:<tab><tab>$49,150.00<tab> <lb> <tab>AutoNation Savings:<tab>-<tab>$3,867.00<tab> <lb> <tab>AutoNation Price:<tab><tab>$45,283.00<tab> <lb> <tab> <tab> <lb> <tab>Sales Tax (estimate):<tab>+<tab>$2,734.98<tab> <lb> <tab>Title Fee:<tab>+<tab>$100.00<tab> <lb> <tab>Tire/Battery/MVWEA:<tab>+<tab>$4.00<tab> <lb> <tab>Tag/Registration Fees (estimate):<tab>+<tab>$207.00<tab> <lb> <tab>Electronic Filing:<tab>+<tab>$20.00<tab> <lb> <tab>Other:<tab>+<tab>$20.00<tab> <lb> <tab>Documentation Fee:<tab>+<tab>$300.00<tab> <lb> <tab>Balance Due (estimate):<tab><tab>$48,668.98<tab> <tab>No Trade-In<lb><lb>LEASE OPTIONS<lb>Cash Due<tab>36 months <tab>42 months <lb><lb>$2,000 <tab>$723<tab>$690<lb>$4,000 <tab>$663<tab>$639<lb>$6,000 <tab>$603<tab>$587<lb><lb><lb>This is my first lease, do these numbers look good? Should I push back or negotiate on anything?<lb><lb>Thanks! I'm looking to lease an a4 premium plus automatic with the nav package. Vehicle Price: $ _NUMBER_ AutoNation Savings: $ _NUMBER_ AutoNation Price: $ _NUMBER_ Sales Tax (estimate): $ _NUMBER_ Title Fee: $ _NUMBER_ Tire/Battery/MVWEA: $ _NUMBER_ Tag/Registration Fees (estimate): $ _NUMBER_ Electronic Filing: $ _NUMBER_ Other: $ _NUMBER_ Documentation Fee: $ _NUMBER_ Balance Due (estimate): $ _NUMBER_ No Trade-In LEASE OPTIONS Cash Due _NUMBER_ months _NUMBER_ months $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ This is my first lease, do these numbers look good? Should I push back or negotiate on anything? Thanks!
1287 0.17 Bulbs Needed:<lb><lb><lb>**194 LED BULB x8**<lb><lb>4- DOORS<lb><lb>2- MAP LIGHTS<lb><lb>2- VANITY<lb><lb><lb>**3022 LED BULB x3**<lb><lb>2- CARGO DOOR<lb><lb>1- DOME LIGHT<lb><lb><lb>**BULBS USED:**<lb><lb>[194 LED BULBS](https://goo.gl/Jfu2Dx)<lb><lb>[3022 LED BULBS](https://goo.gl/fPgk6n)<lb><lb>[Trim Tools](https://goo.gl/hjxw8Z)<lb><lb>Parts list courtesy of [The Blue TRD](https://www.youtube.com/watch?v=CBJxfWdbEfo&amp;t=28s) from his You Tube Channel.<lb><lb>Just passing along the helpful info. Bulbs Needed: ** _NUMBER_ LED BULB x8** _NUMBER_ - DOORS _NUMBER_ - MAP LIGHTS _NUMBER_ - VANITY ** _NUMBER_ LED BULB x3** _NUMBER_ - CARGO DOOR _NUMBER_ - DOME LIGHT **BULBS USED:** _NUMBER_ LED BULBS _NUMBER_ LED BULBS Trim Tools Parts list courtesy of The Blue TRD from his You Tube Channel. Just passing along the helpful info.
142 0.15 Breakdown below:<lb><lb>Elantra GT<lb><lb>2.0L 4-cylinder<lb><lb>6-speed Manual Transmission<lb><lb>$19,350<lb><lb>Elantra GT<lb><lb>2.0L 4-cylinder<lb><lb>6-speed Automatic Transmission w/ SHIFTRONIC®<lb><lb>$20,350<lb><lb>Elantra GT Sport<lb><lb>1.6L Turbo GDI 4-cylinder<lb><lb>6-speed Manual Transmission<lb><lb>$23,250<lb><lb>Elantra GT Sport<lb><lb>1.6L Turbo GDI 4-cylinder<lb><lb>7-speed EcoShift® Dual Clutch Transmission w/ SHIFTRONIC®<lb><lb>$24,350 Breakdown below: Elantra GT _NUMBER_ .0L _NUMBER_ -cylinder _NUMBER_ -speed Manual Transmission $ _NUMBER_ Elantra GT _NUMBER_ .0L _NUMBER_ -cylinder _NUMBER_ -speed Automatic Transmission w/ SHIFTRONIC® $ _NUMBER_ Elantra GT Sport _NUMBER_ .6L Turbo GDI _NUMBER_ -cylinder _NUMBER_ -speed Manual Transmission $ _NUMBER_ Elantra GT Sport _NUMBER_ .6L Turbo GDI _NUMBER_ -cylinder _NUMBER_ -speed EcoShift® Dual Clutch Transmission w/ SHIFTRONIC® $ _NUMBER_
3174 0.13 E-price:<lb>$20,863.00<lb>Freight:<lb>$900.00<lb>Processing Fee:<lb>$299.00<lb>Total before tax and tag fees:<lb>$22,062.00<lb>7% State SALES TAX: $ 1,544.34<lb>2 YEAR TAG FEES: $187.00<lb>TITLE: $100.00<lb>REGISTRATION: $20.00<lb>LIEN: $20.00<lb>INSPECTION: $25.00 <lb>State TEMP TAG: $20.00<lb>State TIRE FEE: $4.00<lb>TOTAL OUT THE DOOR YOU REQUESTED: $ 23,982.34<lb> E-price: $ _NUMBER_ Freight: $ _NUMBER_ Processing Fee: $ _NUMBER_ Total before tax and tag fees: $ _NUMBER_ _NUMBER_ % State SALES TAX: $ _NUMBER_ _NUMBER_ YEAR TAG FEES: $ _NUMBER_ TITLE: $ _NUMBER_ REGISTRATION: $ _NUMBER_ LIEN: $ _NUMBER_ INSPECTION: $ _NUMBER_ State TEMP TAG: $ _NUMBER_ State TIRE FEE: $ _NUMBER_ TOTAL OUT THE DOOR YOU REQUESTED: $ _NUMBER_
3678 0.12 The lease on my 2014 C250 is ending and I have the option to buy it for $20k.<lb>So I decided to see what else is out there for about the same amount of money and so far this are my options:<lb> <lb> <lb>Model: C250<lb>Year: 2014<lb>Miles: 7k<lb>Price: $20k<lb> <lb>Model: E350<lb>Year: 2010<lb>Miles: 59k<lb>Price: $19k<lb> <lb>Model: ML350<lb>Year: 2006<lb>Miles: 42k<lb>Price: $13k<lb> <lb>Thoughts? The lease on my _NUMBER_ C250 is ending and I have the option to buy it for $20k. So I decided to see what else is out there for about the same amount of money and so far this are my options: Model: C250 Year: _NUMBER_ Miles: 7k Price: $20k Model: E350 Year: _NUMBER_ Miles: 59k Price: $19k Model: ML350 Year: _NUMBER_ Miles: 42k Price: $13k Thoughts?
In [14]:
df['language'].value_counts(ascending=False)
Out[14]:
English    4996
Name: language, dtype: int64
In [15]:
df['subreddit'].value_counts(ascending=False)
Out[15]:
Lexus                 266
Hyundai               263
Trucks                262
Honda                 261
MPSelectMiniOwners    260
mercedes_benz         259
mazda3                257
Harley                255
volt                  252
Volkswagen            252
Audi                  252
teslamotors           250
Volvo                 249
Mustang               248
BMW                   239
saab                  239
4Runner               238
Porsche               236
subaru                233
Wrangler              229
Name: subreddit, dtype: int64

Explore idiosyncrasies of various columns, e.g. same speaker represented multiple ways.


Explore Text Column|¶

Top Words Used¶

In [16]:
remove_tokens = {'_number_', 'car'}
count_tokens(df['partial_lemmas'], remove_tokens=remove_tokens).head(10)
Out[16]:
frequency
token
look 2776
like 2355
drive 1880
know 1812
new 1738
want 1687
buy 1556
thank 1497
work 1467
think 1459

Distribution of Text Length¶

In [17]:
ax = df['post_length'].plot(kind='box', vert=False, figsize=(10, 1))
ax.set_title("Distribution of Post Length")
ax.set_xlabel("# of Characters")
ax.set_yticklabels([])
ax;
In [18]:
ax = df['post_length'].plot(kind='hist', bins=60, figsize=(10, 2));
ax.set_title("Distribution of Post Length")
ax.set_xlabel("# of Characters")
ax;
In [19]:
import seaborn as sns
sns.displot(df['post_length'], bins=60, kde=True, height=3, aspect=3);
In [20]:
where = df['subreddit'].isin([
    'Lexus', 
    'mercedes_benz',
    'Audi',
    'Volvo',
    'BMW',
])
g = sns.catplot(data=df[where], x="subreddit", y="post_length", kind='box')
g.fig.set_size_inches(6, 3)
g.fig.set_dpi(100)
g = sns.catplot(data=df[where], x="subreddit", y="post_length", kind='violin')
g.fig.set_size_inches(6, 3)
g.fig.set_dpi(100)

Word Frequency¶

In [21]:
counts_df = count_tokens(df['partial_lemmas'], remove_tokens=remove_tokens)
In [22]:
def plot_wordcloud(frequency_dict):
    wc = wordcloud.WordCloud(background_color='white',
        #colormap='RdYlGn',
        colormap='tab20b',
        width=round(hlpp.STANDARD_WIDTH*100),
        height=round(hlpp.STANDARD_HEIGHT*100),
        max_words = 200, max_font_size=150,
        random_state=42
    )
    wc.generate_from_frequencies(frequency_dict)

    fig, ax = plt.subplots(figsize=(hlpp.STANDARD_WIDTH, hlpp.STANDARD_HEIGHT))
    ax.imshow(wc, interpolation='bilinear')
    #plt.title("XXX")
    plt.axis('off')
In [23]:
plot_wordcloud(counts_df.to_dict()['frequency']);

TF-IDF¶

In [24]:
tf_idf_lemmas = tf_idf(
    df=df,
    tokens_column='partial_lemmas',
    segment_columns = None,
    min_frequency_corpus=20,
    min_frequency_document=20,
    remove_tokens=remove_tokens,
)
tf_idf_lemmas.head()
Out[24]:
frequency tf-idf
token
look 2776 3043.65
drive 1880 2867.85
like 2355 2830.75
new 1738 2577.63
mile 1406 2510.05
In [25]:
remove_tokens_bi_grams = {'_number_ year', '_number_ _number_', 'hey guy'}
tf_idf_bi_grams = tf_idf(
    df=df,
    tokens_column='bi_grams',
    segment_columns = None,
    min_frequency_corpus=20,
    min_frequency_document=20,
    remove_tokens=remove_tokens_bi_grams,
)
tf_idf_bi_grams.head()
Out[25]:
frequency tf-idf
token
$-_number_ 1089 2366.46
_number_-mile 553 1417.20
_number_-year 397 1123.63
_number_-_number_ 285 890.38
look-like 229 749.94
In [26]:
tf_idf_nouns = tf_idf(
    df=df,
    tokens_column='nouns',
    segment_columns = None,
    min_frequency_corpus=20,
    min_frequency_document=20,
    remove_tokens=remove_tokens,
)
tf_idf_nouns.head()
Out[26]:
frequency tf-idf
token
mile 1404 2507.99
issue 1142 2240.32
year 1227 2226.76
time 1289 2216.42
engine 944 2100.66
In [27]:
tf_idf_noun_phrases = tf_idf(
    df=df,
    tokens_column='noun_phrases',
    segment_columns = None,
    min_frequency_corpus=20,
    min_frequency_document=20,
    remove_tokens=remove_tokens_bi_grams,
)
tf_idf_noun_phrases.head()
Out[27]:
frequency tf-idf
token
oil-change 132 514.63
new-car 130 502.27
test-drive 110 442.43
engine-light 101 439.81
year-old 105 426.65

In [28]:
ax = tf_idf_lemmas.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Uni-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
In [29]:
ax = tf_idf_bi_grams.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Bi-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
In [30]:
ax = tf_idf_nouns.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Bi-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
In [31]:
ax = tf_idf_noun_phrases.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Bi-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
In [32]:
plot_wordcloud(tf_idf_lemmas.to_dict()['tf-idf']);
In [33]:
plot_wordcloud(tf_idf_bi_grams.to_dict()['tf-idf']);

By Subreddit¶

In [34]:
remove_tokens_subreddit = set(df.subreddit.str.lower().unique())
remove_tokens_subreddit
Out[34]:
{'4runner',
 'audi',
 'bmw',
 'harley',
 'honda',
 'hyundai',
 'lexus',
 'mazda3',
 'mercedes_benz',
 'mpselectminiowners',
 'mustang',
 'porsche',
 'saab',
 'subaru',
 'teslamotors',
 'trucks',
 'volkswagen',
 'volt',
 'volvo',
 'wrangler'}
In [35]:
tf_idf_lemmas_per_sub = tf_idf(
    df=df,
    tokens_column='partial_lemmas',
    segment_columns = 'subreddit',
    min_frequency_corpus=10,
    min_frequency_document=10,
    remove_tokens=remove_tokens | remove_tokens_subreddit 
)
tf_idf_lemmas_per_sub.head(5)
Out[35]:
frequency tf-idf
subreddit token
4Runner gen 74 283.40
sr5 50 241.65
lift 57 215.97
rear 61 171.54
look 148 162.27
In [36]:
tf_idf_bigrams_per_sub = tf_idf(
    df=df,
    tokens_column='bi_grams',
    segment_columns = 'subreddit',
    min_frequency_corpus=10,
    min_frequency_document=10,
    remove_tokens=remove_tokens_bi_grams
)
tf_idf_bigrams_per_sub.head(5)
Out[36]:
frequency tf-idf
subreddit token
4Runner _number_-4runner 41 201.05
3rd-gen 27 138.26
$-_number_ 60 130.38
_number_-sr5 23 129.29
4th-gen 15 90.78
In [37]:
tf_idf_nouns_per_sub = tf_idf(
    df=df,
    tokens_column='nouns',
    segment_columns = 'subreddit',
    min_frequency_corpus=10,
    min_frequency_document=10,
    remove_tokens=remove_tokens | remove_tokens_subreddit
)
tf_idf_nouns_per_sub.head(5)
Out[37]:
frequency tf-idf
subreddit token
4Runner gen 74 283.40
sr5 50 241.65
lift 42 178.41
mile 82 146.48
trd 26 146.16
In [38]:
tf_idf_nounphrases_per_sub = tf_idf(
    df=df,
    tokens_column='noun_phrases',
    segment_columns = 'subreddit',
    min_frequency_corpus=10,
    min_frequency_document=10,
    remove_tokens=remove_tokens_bi_grams
)
tf_idf_nounphrases_per_sub.head(5)
Out[38]:
frequency tf-idf
subreddit token
4Runner sway-bar 12 65.78
check-engine 10 44.13
Harley new-bike 14 84.73
spark-plug 10 46.85
Honda oil-change 14 54.58

In [39]:
tokens_to_show = tf_idf_lemmas_per_sub.query("subreddit in ['Lexus', 'Volvo']").reset_index()
tokens_to_show.head()
Out[39]:
subreddit token frequency tf-idf
0 Lexus is350 37 198.29
1 Lexus look 166 182.01
2 Lexus mile 101 180.31
3 Lexus gs 30 160.77
4 Lexus drive 103 157.12
In [40]:
px.bar(
    tokens_to_show.groupby(['subreddit']).head(20).sort_values('tf-idf', ascending=True),
    x='tf-idf',
    y='token',
    color='subreddit',
    barmode='group',
    title="Top 20 Lemmas for Volvo & Lexus"
)
In [41]:
tokens_to_show = tf_idf_bigrams_per_sub.query("subreddit in ['Lexus', 'Volvo']").reset_index()
tokens_to_show.head()
Out[41]:
subreddit token frequency tf-idf
0 Lexus $-_number_ 54 117.35
1 Lexus _number_-lexus 19 105.00
2 Lexus f-sport 16 94.55
3 Lexus _number_-is250 13 82.09
4 Lexus es-_number_ 13 80.85
In [42]:
px.bar(
    tokens_to_show.groupby(['subreddit']).head(20).sort_values('tf-idf', ascending=True),
    x='tf-idf',
    y='token',
    color='subreddit',
    barmode='group',
    title="Top 20 Bi-Grams for Volvo & Lexus"
)

In [43]:
get_context_from_keyword(df.query("subreddit == 'Lexus'")['post'], keyword='think')
Out[43]:
3870     cheap and easy fix or do you guys  |think|  an insurance claim will end up nee
2365           Hi There,<lb><lb>I'm really  |think| ing about an RC 350 but I have one 
1155    need to buy to do this myself? I'm  |think| ing I need the following:<lb><lb>- 
3587    e 2012 GS250 vs the 2014 IS250?  I  |think|  it's the same engine the GS250 wil
4994    n blowing the AM2 fuse and the guy  |think| s it's something on the ignition re
3897    n driving them for a while and was  |think| ing about trying something differen
1474    es' of information (voltage aside)  |think| s that one of the sensors must be b
2827    ead of just replacing the bulb I'm  |think| ing of just getting some new headli
3952    ..<lb><lb>But what else? I am just  |think| ing if they are about the same pric
490                        Hey guys so I'm  |think| ing about getting a CPO by the end 
dtype: object
In [44]:
get_context_from_keyword(df.query("subreddit == 'Volvo'")['post'], keyword='think')
Out[44]:
1194    wrench even came out, and I didn't  |think|  they were that tight. This can't b
3411     the GLE model.  Anyways I've been  |think| ing about purchasing this Volvo 740
985      2018 s90 - and a couple of what I  |think|  are glichy things are coming up 2 
2134    etween gears while moving. I don't  |think|  the clutch is fully disengaging fr
1089    NII<lb><lb>Worst case scenario I'm  |think| ing the car could have been in an a
14      ience are the any other issues you  |think|  I should address to avoid future p
14       already said, it runs great and I  |think|  it has a lot of miles left on it i
2902                       Hi guys!<lb>I'm  |think| ing about having my front seats dee
4099    mes on.  I try to stay positive by  |think| ing maybe my mechanic didn't top me
1314    d reviews, so I don't know what to  |think| . In fact, I don't even know what t
dtype: object
Lexus¶
In [45]:
tokens_to_show = tf_idf_lemmas_per_sub.query("subreddit == 'Lexus'").reset_index()
#tokens_to_show = tokens_to_show[~tokens_to_show.token.isin(stop_words)]
tokens_to_show = tokens_to_show[['token', 'tf-idf']].set_index('token')
tokens_to_show = tokens_to_show.to_dict()['tf-idf']
plot_wordcloud(tokens_to_show);
Volvo¶
In [46]:
tokens_to_show = tf_idf_lemmas_per_sub.query("subreddit == 'Volvo'").reset_index()
#tokens_to_show = tokens_to_show[~tokens_to_show.token.isin(stop_words)]
tokens_to_show = tokens_to_show[['token', 'tf-idf']].set_index('token')
tokens_to_show = tokens_to_show.to_dict()['tf-idf']
plot_wordcloud(tokens_to_show);

Keywords in Context¶

In [47]:
contexts = get_context_from_keyword(
    documents=df['post'],
    window_width=50,
    keyword='replac',
    num_samples = 20,
    random_seed=42
)
for x in contexts:
    print(x)
rger.<lb>* My Elantra still runs great and I just  |replac| ed the tires.<lb>* The financials are all there, b
 for a while, but now the time has come for me to  |replac| e my coil springs. I had heard great things about 
I'm  |replac| ing the radio with a Pioneer AVH-X2800BS in my '04
<lb><lb>I sent it back to PA Performance and they  |replac| ed all the internals. While I was waiting on PA Pe
 Also, the battery in the car seems like it needs  |replac| ing as the interior lights flicker and the car has
ere can confirm that before I go digging deep for  |replac| ement panels. Thanks!
and my steering stabilizer is shot.  Do I need to  |replac| e it to keep my tires healthy or not?  I really do
nyone knows the type of them? Or where can i find  |replac| ements? I will add pictures, the small one is from
 is having some problems so that might need to be  |replac| ed. Could that explain the floatiness? <lb><lb>Als
<lb>The coolant reservoir tank seems to have been  |replac| ed, I heard it's a common problem on these.
Anyone else experience this? Dealer quote to  |replac| e it was insane for what I imagine is a relatively
I know the gasket needs to be  |replac| ed and the mechanic i take mine too agreed to let 
I also asked Mazda service dealers for a quote in  |replac| ing this part. The quotes varied wildly with some 
 it not for corrosion around the old part.<lb><lb> |Replac| ing the vehicle speed sensor, or driven-speed gear
lb>My boss has a 2010 Dodge Ram 1500 and wants to  |replac| e his old, corroded bars with some new ones. Thing
 the things that I am seeing is pretty much about  |replac| ing the entire intake manifold. Also seeing a lot 
My fault, I fucked up  |replac| ing the door lock actuator and now my rear passeng
 may bite the bullet and buy a used door panel to  |replac| e it with unless you guys have any suggestions. 
gs, they were original, appear to have aged well.  |Replac| ed the plugs, and switched the coil pack from 1-2 
 every now and then it simply won't start. I have  |replac| ed and checked the battery, but I'm thinking I may